Assignment 1: Introduction!

date()
## [1] "Tue Nov 29 02:56:42 2022"

Hello! My name is Ghaida, you can also call me Gege. I am a first year master’s student at University of Helsinki’s Contemporary Societies program.

Click here to visit my Github page!
Click here to see my Learning Diary!


Assignment 2: Regression and Model Validation

date()
## [1] "Tue Nov 29 02:56:42 2022"

Reading the data

data <- read.csv(file = "learning2014.csv", sep=",", header = TRUE)

Examining the structure and dimension of the data

dim(data)
## [1] 166   7
str(data)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : chr  "F" "M" "F" "M" ...
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...

The data has 166 observations and 7 variables, which are:

Variable name Description
gender Gender of the student (character, F/M)
Age Age of the students in years (integer)
attitude The average score of the student’s attitude towards statistics (numeric)
deep The student’s average score on deep learning approach (numeric)
stra The student’s average score on strategic learning approach (numeric)
surf The student’s average score on surface learning approach
Points The student’s exam points (integer)

Overview of the data

#accessing the libraries
library(ggplot2)
library(GGally)
#drawing scatter plot matrix
ggpairs(data, mapping = aes(col=gender, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))

The data for female students are represented in pink, while the data for male students are represented in green. The frequency graph shows us that we have a lot more data from female students. The students’ age has a highly skewed distribution graph, which tells us that most of the students are on the younger side, but there are a few much older students as well. We can see that the student’s attitude towards statistics are moderately positively correlated with their exam points, which is not surprising. An interesting correlation can be seen between surface and deep learning scores. The two variables have a negative correlation, which means that students who score higher on surface learning tend to score lower in deep learning, and vice versa. However, a moderately high negative correlation is only observed in male students, not in female students. A similar phenomenon can also be observed between surface learning score and the students’ attitude towards statistics.

summary(data)
##     gender               Age           attitude          deep      
##  Length:166         Min.   :17.00   Min.   :1.400   Min.   :1.583  
##  Class :character   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333  
##  Mode  :character   Median :22.00   Median :3.200   Median :3.667  
##                     Mean   :25.51   Mean   :3.143   Mean   :3.680  
##                     3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083  
##                     Max.   :55.00   Max.   :5.000   Max.   :4.917  
##       stra            surf           Points     
##  Min.   :1.250   Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.625   1st Qu.:2.417   1st Qu.:19.00  
##  Median :3.188   Median :2.833   Median :23.00  
##  Mean   :3.121   Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.625   3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :5.000   Max.   :4.333   Max.   :33.00

Here, we can see the descriptive statistics of each variable in the data. The students’ age ranges from 17 to 55 years, while the average is 25.5 years old. The students’ attitude towards statistics averages at 3.14, which means that the students have a slightly more positive attitude towards statistics in general. Among the three learning approaches, deep learning has the highest average score while surface learning has the lowest average score. Meanwhile, the students’ exam point ranges from 7 to 33 and averages at 22.7.

Building the model

library(tidyverse)
#3 variables as explanatory variables
fit <- data %>%
  lm(Points ~ attitude + stra + deep , data = .)
summary(fit)
## 
## Call:
## lm(formula = Points ~ attitude + stra + deep, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.5239  -3.4276   0.5474   3.8220  11.5112 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.3915     3.4077   3.343  0.00103 ** 
## attitude      3.5254     0.5683   6.203 4.44e-09 ***
## stra          0.9621     0.5367   1.793  0.07489 .  
## deep         -0.7492     0.7507  -0.998  0.31974    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 162 degrees of freedom
## Multiple R-squared:  0.2097, Adjusted R-squared:  0.195 
## F-statistic: 14.33 on 3 and 162 DF,  p-value: 2.521e-08

The low p-value and positive estimate suggest that the student’s attitude towards statistics is significantly and positively related to their exam points. We also found an evidence of a positive relationship between the students’ strategic learning score and their exam points (p-value = 0.075, significant at the 0.1 level). However, we did not find any evidence of a relationship between the students’ deep learning score and their exam points. Next, we will remove this variable and fit the model again without it.

#remove insignificant variable
fit2 <- data %>%
  lm(Points ~ attitude + stra, data = .)
summary(fit2)
## 
## Call:
## lm(formula = Points ~ attitude + stra, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.6436  -3.3113   0.5575   3.7928  10.9295 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.9729     2.3959   3.745  0.00025 ***
## attitude      3.4658     0.5652   6.132 6.31e-09 ***
## stra          0.9137     0.5345   1.709  0.08927 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared:  0.2048, Adjusted R-squared:  0.1951 
## F-statistic: 20.99 on 2 and 163 DF,  p-value: 7.734e-09

We obtained a similar result with the previous model. With a very low p-value, the students’ attitude towards statistics was still found to be a highly significant predictor of their exam point. The students’ exam score increases by 3.46 points on average with each increase in the students’ attitude score, assuming that their strategic learning score remains constant. Meanwhile, the students’ strategic learning score was found to be significant at 0.1 level (p-value = 0.08). The student’s exam score increases by 0.91 points on average with each increase in the students’ strategic learning score, assuming that their attitude score remains constant. In another word, students who have a more positive attitude towards statistics and/or practices strategic learning approach tend to have a higher exam point.

The adjusted R-squared of this model was 0.1951, which means that this model explains about 19.51 percent of the variations in the student’s exam point. Considering that this is a very simple model, this is already a quite high R-squared value.

Diagnostics

In linear regression, we assume a linear correlation between the variables, and the error term/residual is normally distributed. We also assume that the variance of the residuals are equal across all predicted values (homoscedasticity). The residuals vs. fitted values plot and the Q-Q plot can be used to check these assumptions.

plot(fit2, which = c(1, 2, 5))

In the residuals vs. fitted values plot, we can see that the residuals seem to be randomly scattered. It does not seem to display any concerning patterns, such as a curve (suggesting non-linearity) or a trombone pattern (suggesting heteroscedasticity). Based on this plot, it seems that the data is linear and homoscedastic (the variance of the residuals tend to be equal across all predicted values).

The Q-Q plot compares the standardized residuals to their theoretical quantiles (the values they should have if the normality assumption is fulfilled). If the assumption is fulfilled, the points should fall across the straight line. In this plot, we can see that the points seem to form a slight upward curve. This means that the distribution of the residuals are actually a bit left skewed.

The residuals vs leverage plot is used to check if there is any outliers that might affect the model heavily. There seem to be several extreme values in the data, but none of them fall outside of the Cook’s distance line, so they are not necessarily considered to be influential to the model.


Assignment 3: Logistic Regression

date()
## [1] "Tue Nov 29 02:56:53 2022"

Setting up

Reading the data

data <- read.csv(file = "alc.csv", sep=",", header = TRUE)

Explaining the data

#printing out the variables name
colnames(data)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "schoolsup" 
## [16] "famsup"     "activities" "nursery"    "higher"     "internet"  
## [21] "romantic"   "famrel"     "freetime"   "goout"      "Dalc"      
## [26] "Walc"       "health"     "failures"   "paid"       "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

This week, we are working with the “Student Performance Data Set” from the UCI Machine Learning Repository. This data was made to study student performance. It contains 35 variables from 370 students in two Portugese high schools. The data was collected from school reports and questionnaires. The variables in this data include:

Variable Name Description
school The student’s school (“GP” (Gabriel Pereira) or “MS” (Mousinho da Silveira))
sex The student’s sex (“F” or “M”)
age The student’s age (numeric from 15-22)
address The student’s home address type (“U” (urban) or “R” (rural))
famsize The student’s family size (“LE3” (<=3) or “GT3” (>3))
Pstatus The parent’s cohabitation status (“T” (living together) or “A” (apart))
Medu The mother’s education level (“0” (no education), “1” (up to 4th grade), “2” (5th to 9th grade), “3” (secondary education), or “4” (higher education))
Fedu The father’s education level (“0” (no education), “1” (up to 4th grade), “2” (5th to 9th grade), “3” (secondary education), or “4” (higher education))
Mjob The mother’s job (“teacher”, “health” related, civil “services” (e.g. administrative or police), “at_home”, or “other”
Fjob The father’s job (“teacher”, “health” related, civil “services” (e.g. administrative or police), “at_home”, or “other”
reason The reason for choosing the school (close to “home”, school “reputation”, “course” preference, or “other”)
guardian The student’s guardian (“mother”, “father”, or “other”)
traveltime Home to school travel time (“1” (<15 min), “2” (15 to 30 min), “3” (30 min. to 1 hour), or “4” (>1 hour))
studytime Time spent on studying weekly (“1” (<2 hours), “2” (2 to 5 hours), “3” (5 to 10 hours), or “4” (>10 hours))
failures Number of past class failures (numeric from 1-4 where 4 includes 4 or more classes)
schoolsup Extra educational support (“yes” or “no”)
famsup Family educational support (“yes” or “no”)
paid Extra paid classes within the course subject (“yes” or “no”)
activities Extracurricular activities (“yes” or “no”)
nursery Attended nursery school (“yes” or “no”)
higher Wants to take higher education (“yes” or “no”)
internet Availability of internet access at home (“yes” or “no”)
romantic In a romantic relationship (“yes” or “no”)
famrel Quality of family relationship (numeric from 1 (very bad) to 5 (excellent))
freetime Free time after school (numeric from 1 (very low) to 5 (very high))
goout Going out with friends (numeric from 1 (very low) to 5 (very high))
Dalc Workday alcohol consumption (numeric from 1 (very low) to 5 (very high))
Walc Weekend alcohol consumption (numeric from 1 (very low) to 5 (very high))
health Current health status (numeric from 1 (very bad) to 5 (very good))
absences Number of school absences (numeric from 0-93)
G1 First period grade (numeric from 0-20)
G2 Second period grade (numeric from 0-20)
G3 Final grade (numeric from 0-20)
alc_use Average alcohol use in both weekend (Walc) and weekdays (Dalc) (between 1-5)
high_use High use of alcohol (“TRUE” (alc_use >2) or “FALSE” (alc_use <=2)

Formulating the hypothesis

  1. Male students (sex) tend to have higher alcohol consumption (high_use)

  2. Students with more school absences (absences) would tend to have higher alcohol consumption (high_use)

  3. Students who spend less time studying (studytime) would tend to have higher alcohol consumption (high_use)

  4. Students who go out more (goout) would tend to have higher alcohol consumption (high_use)

Exploring the chosen variables

library(patchwork)
library(tidyverse)
library(finalfit)

Alcohol use and gender

data %>% 
  summary_factorlist(dependent   = "high_use", 
                     explanatory = "sex")
##  label levels      FALSE      TRUE
##    sex      F 154 (59.5) 41 (36.9)
##             M 105 (40.5) 70 (63.1)

The cross tabulation above reveals that among 195 female students in the data, 41 of them have high alcohol consumption. Meanwhile, among 175 male students in the data, 70 of them have high alcohol consumption. Among students who have high alcohol consumption, 63.1 percent are male. We will next visualize this with plots.

#count plot
t1 <- data %>% 
  ggplot(aes(x = sex, fill = high_use)) + 
  geom_bar() + 
  theme(legend.position = "none")+ 
  theme(legend.position = "bottom")+
  labs(x = "Gender (sex)")
t1 <- t1 + scale_fill_discrete(name = "High use of alcohol")

#proportion plot
t2 <- data %>% 
  ggplot(aes(x = sex, fill = high_use)) + 
  geom_bar(position = "fill") + 
  ylab("proportion")+ 
  theme(legend.position = "none")+
  labs(x = "Gender (sex)")
 
t1+t2

From the bar plot above, we can see that even though we have more female students in the data, the number of male students with high alcohol consumption is higher. This is more apparent in the proportion plot, where we can see that the proportion of students with high alcohol consumption is a lot higher in male students compared to female students.

Alcohol use and school absences

data %>% 
  summary_factorlist(dependent   = "high_use", 
                     explanatory = "absences")
##     label    levels     FALSE      TRUE
##  absences Mean (SD) 3.7 (4.5) 6.4 (7.1)

Students with high alcohol consumption on average has around 6.4 absences, while students with lower alcohol consumption on average has 3.7 absences. Next, we will also compare the distribution of each groups with box plots.

data %>% 
  ggplot(aes(x = absences, y = high_use)) +
  geom_boxplot(aes(colour=high_use))+
  labs(x = "Number of school absences (absences)", 
       y = "High use of alcohol")+
  theme(legend.position = "none")

Students with high alcohol consumption have a higher school absences median compared to students with lower use of alcohol. Both groups have quite skewed data, which means that most students have very few absences but there are some students who have very high number of absences.

Alcohol use and study time

For better presentation of the data, I will first recode the variable:

data <- data %>%
  mutate(studytime2 = studytime %>% factor() %>% 
           fct_recode("<2 hours" = "1", "2-5 hours" = "2", 
                      "5-10 hours" = "3", ">10 hours" = "4"))
#cross tabulation
data %>% 
  summary_factorlist(dependent   = "high_use", 
                     explanatory = "studytime2")
##       label     levels      FALSE      TRUE
##  studytime2   <2 hours  56 (21.6) 42 (37.8)
##              2-5 hours 128 (49.4) 57 (51.4)
##             5-10 hours  52 (20.1)   8 (7.2)
##              >10 hours   23 (8.9)   4 (3.6)

Among students who have high alcohol consumption, 51.4 percent spend 2-5 hours/week studying and 37.8 percent spend less than 2 hours studying. In total, 89.2 percent of students with high alcohol consumption spend 5 hours or less per week for studying, which is quite a high number. Next, we will visualize this with plots.

#count plot
f1 <- data %>% 
  ggplot(aes(x = studytime2, fill = high_use)) + 
  geom_bar() + 
  theme(legend.position = "none")+ 
  theme(legend.position = "bottom")+
  labs(x = "Time spent on studying weekly")
f1 <- f1 + scale_fill_discrete(name = "High use of alcohol")

#proportion plot
f2 <- data %>% 
  ggplot(aes(x = studytime2, fill = high_use)) + 
  geom_bar(position = "fill") + 
  ylab("proportion")+ 
  theme(legend.position = "none")+
  labs(x = "Time spent on studying weekly")
 
f1+f2

Most students spent 2-5 hours weekly for studying. Students in this group also have the highest count of students with high alcohol consumption, but according to the proportion plot, students who spend lest than 2 hours for studying have the highest proportion of students with high alcohol consumption. Meanwhile, the proportion of students with high alcohol consumption is not very different between the group of students who study for 5-10 hours/week and students who spend more than 10 hours/week to study.

Alcohol use and going out

For better presentation of the data, I will first recode the variable:

data <- data %>%
  mutate(goout2 = goout %>% factor() %>% 
           fct_recode("very rarely" = "1", "rarely" = "2", 
                      "medium" = "3", "often" = "4", "very often" = "5"))
#cross tabulation
data %>% 
  summary_factorlist(dependent   = "high_use", 
                     explanatory = "goout2")
##   label      levels     FALSE      TRUE
##  goout2 very rarely  19 (7.3)   3 (2.7)
##              rarely 82 (31.7) 15 (13.5)
##              medium 97 (37.5) 23 (20.7)
##               often 40 (15.4) 38 (34.2)
##          very often  21 (8.1) 32 (28.8)

Among students with high alcohol consumption, 34.2 percent go out often with their friends and 28.8 percent go out very often. From this table, we can already see that the proportion of students with high alcohol consumption is higher in groups of students who go out more often, but we will check this with plots.

#count plot
g1 <- data %>% 
  ggplot(aes(x = goout2, fill = high_use)) + 
  geom_bar() + 
  theme(legend.position = "none")+ 
  theme(legend.position = "bottom")+
  labs(x = "Going out with friends")
g1 <- g1 + scale_fill_discrete(name = "High use of alcohol")

#proportion plot
g2 <- data %>% 
  ggplot(aes(x = goout2, fill = high_use)) + 
  geom_bar(position = "fill") + 
  ylab("proportion")+ 
  theme(legend.position = "none")+
  labs(x = "Going out with friends")
 
g1+g2

In the proportion plot, we can see that the proportion of students with high alcohol consumption gets higher in groups of students who go out more. The difference is subtle between the groups of students who go out less often, but in groups of students who go out often or very often, the proportion is very high.

Logistic regression

Summary

# find the model with glm()
# I will treat all explanatory variables as factors except for absences
m <- glm(high_use ~ sex + absences + studytime2 + goout2, data = data, family = "binomial")

# print out a summary of the model
summary(m)
## 
## Call:
## glm(formula = high_use ~ sex + absences + studytime2 + goout2, 
##     family = "binomial", data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8789  -0.7759  -0.4636   0.7578   2.4917  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -2.55088    0.72954  -3.497 0.000471 ***
## sexM                  0.89483    0.27955   3.201 0.001370 ** 
## absences              0.07846    0.02312   3.394 0.000688 ***
## studytime22-5 hours  -0.40001    0.30551  -1.309 0.190424    
## studytime25-10 hours -0.89977    0.48249  -1.865 0.062205 .  
## studytime2>10 hours  -1.17641    0.63893  -1.841 0.065591 .  
## goout2rarely          0.44073    0.73079   0.603 0.546448    
## goout2medium          0.69593    0.71421   0.974 0.329855    
## goout2often           2.10900    0.71571   2.947 0.003212 ** 
## goout2very often      2.37031    0.73186   3.239 0.001200 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 452.04  on 369  degrees of freedom
## Residual deviance: 362.43  on 360  degrees of freedom
## AIC: 382.43
## 
## Number of Fisher Scoring iterations: 4

According to the p-value and the positive estimate, male students significantly have higher tendency to have high alcohol consumption (p-value = 0.0014). The number of absences was also found to be a significant predictor to high alcohol consumption (p-value = 0.0007). Meanwhile, students who spend 5 hours or more for studying each week were found to be significantly less likely to have high alcohol consumption at 0.1 level compared to students who study less than 2 hours/week (p-value = 0.062 for students who study 5-10 hours/week and 0.065 for students who study more than 10 hours/week). Students who go out often or very often are significantly more likely to have high alcohol consumption compared to students who go out very rarely (p-value = 0.003 for students who go out often and 0.001 for students who go out very often).

Odds ratio and confidence interval

# compute odds ratios (OR)
OR <- coef(m) %>% exp

# compute confidence intervals (CI)
CI <- confint(m) %>% exp()
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                               OR      2.5 %    97.5 %
## (Intercept)           0.07801306 0.01489892  0.281652
## sexM                  2.44691728 1.42235692  4.267696
## absences              1.08162325 1.03506042  1.134694
## studytime22-5 hours   0.67031109 0.36749404  1.221300
## studytime25-10 hours  0.40666321 0.15038986  1.014202
## studytime2>10 hours   0.30838487 0.07773944  0.998947
## goout2rarely          1.55384652 0.41639530  7.944878
## goout2medium          2.00557429 0.56021310 10.005175
## goout2often           8.23995705 2.30747880 41.326016
## goout2very often     10.70074528 2.88295245 54.872360

The table above presents the odds ratio for each variable in the model and its confidence interval.

  1. Male students are 2.45 (CI = 1.42 - 4.27) times more likely to have high alcohol consumption compared to female students. This finding supports our original hypothesis.

  2. For each increase in the number of school absences, the students are 1.08 (CI = 1.04 - 1.13) times more likely to have high alcohol consumption. This finding also supports our original hypothesis.

  3. We found a very weak evidence of a relationship between time spent on studying and high alcohol consumption. The only group within the study time variable that has a confidence interval that do not cross 1 is the group of students who study for more than 10 hours a week. The students in this group are 3.25 (CI = 1.002 - 12.82) times less likely to have high alcohol consumption compared to students who study for less than 2 hours/week. I flipped the odds (1/x) to have a more intuitive interpretation. Meanwhile, every other groups in this variable have a confidence interval that crosses 1, which means that from this model, it is unclear if they are significantly less likely to have high alcohol consumption.

  4. Compared to students who very rarely go out, students who go out often are 8.24 (CI = 2.31 - 41.33) times more likely to have higher alcohol consumption. Students who go out very often are 10.7 (CI = 2.89 - 54.87) times more likely to have higher alcohol consumption. Meanwhile, it is not apparent from the model if students who go out moderately or rarely are significantly more likely to have high alcohol consumption compared to students who very rarely go out. In general, this finding goes in line with our original hypothesis.

Exploring the predictive power

I will use all variables since they are all significant at least at 0.1 level.

# predict() the probability of high_use
probabilities <- predict(m, type = "response")

# add the predicted probabilities to data
data <- mutate(data, probability = probabilities)

# use the probabilities to make a prediction of high_use
data <- mutate(data, prediction = probabilities > 0.5)

# tabulate the target variable versus the predictions
table(high_use = data$high_use, prediction = data$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   239   20
##    TRUE     55   56

Among 294 students who are predicted to not have high alcohol consumption, 239 are predicted correctly. Meanwhile, among 76 students who are predicted to have high alcohol consumption, 56 are predicted correctly. Next we will visualize this with a plot.

#count plot
p1 <- data %>% 
  ggplot(aes(x = prediction, fill = high_use)) + 
  geom_bar() + 
  theme(legend.position = "none")+ 
  theme(legend.position = "bottom")+
  labs(x = "Model Prediction")
p1 <- p1 + scale_fill_discrete(name = "High use of alcohol")

#proportion plot
p2 <- data %>% 
  ggplot(aes(x = prediction, fill = high_use)) + 
  geom_bar(position = "fill") + 
  ylab("proportion")+ 
  theme(legend.position = "none")+
  labs(x = "Model Prediction")
 
p1+p2

From the proportion plot, we can see that most of the students predicted to have high alcohol consumption do have a high alcohol consumption in the data, and vice versa. Next, we will compute the total proportion of inaccurately classified individuals.

# define a loss function (average prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
train_er <- loss_func(class = data$high_use, prob = data$probability)
train_er
## [1] 0.2027027

The proportion of wrong predictions is 20.27 percent, which is a lot better than simple guessing strategy (50 percent).

10-fold cross validation

library(boot)
cv <- cv.glm(data = data, cost = loss_func, glmfit = m, K = 10)

# average number of wrong predictions in the cross validation
test_er <- cv$delta[1]
test_er
## [1] 0.2189189

The average number of wrong predictions in the cross validation was around 22 percent, which is a little lower than the model in the Exercise set (26 percent).

Cross validation comparison

# model with 3 variables
m3 <- glm(high_use ~ sex + absences + goout2, data = data, family = "binomial")
probabilities3 <- predict(m3, type = "response")
data <- mutate(data, probability3 = probabilities3)
data <- mutate(data, prediction3 = probabilities3 > 0.5)
train_er3 <- loss_func(class = data$high_use, prob = data$probability3)
cv3 <- cv.glm(data = data, cost = loss_func, glmfit = m3, K = 10)
test_er3 <- cv3$delta[1]

# model with 2 variables
m2 <- glm(high_use ~ sex + absences, data = data, family = "binomial")
probabilities2 <- predict(m2, type = "response")
data <- mutate(data, probability2 = probabilities2)
data <- mutate(data, prediction2 = probabilities2 > 0.5)
train_er2 <- loss_func(class = data$high_use, prob = data$probability2)
cv2 <- cv.glm(data = data, cost = loss_func, glmfit = m2, K = 10)
test_er2 <- cv2$delta[1]

# model with 1 variable
m1 <- glm(high_use ~ absences, data = data, family = "binomial")
probabilities1 <- predict(m1, type = "response")
data <- mutate(data, probability1 = probabilities1)
data <- mutate(data, prediction1 = probabilities1 > 0.5)
train_er1 <- loss_func(class = data$high_use, prob = data$probability1)
cv1 <- cv.glm(data = data, cost = loss_func, glmfit = m1, K = 10)
test_er1 <- cv1$delta[1]

#plotting
compare <- data.frame(model = c("4 Var", "3 Var", "2 Var", "1 Var", 
                                "4 Var", "3 Var", "2 Var", "1 Var"),
                      error = c(train_er, train_er3, train_er2, train_er1,
                                test_er, test_er3, test_er2, test_er1),
                      type = c("Training Error", "Training Error",
                               "Training Error", "Training Error",
                               "Testing Error", "Testing Error",
                               "Testing Error", "Testing Error"))

compare %>% 
  ggplot(aes(x = model, y = error, group = type)) +
  geom_line(aes(colour = type))+
  scale_x_discrete(limits = rev) +
  labs(x = "Number of variables",
       y = "Error rate")+
  geom_text(
    label=round(compare$error, digits=2),
    check_overlap = T)

Generally, both testing and training error is higher as the number of variables in the model gets lower. This is not surprising as models with higher number of (appropriate) variables would tend to have better prediction. However, we see a dip in testing error rate at three variables compared to a model with four variables. This may mean that the fourth variable we add to the model did not help better the prediction much, so it may be better to use the three variables model instead.


Assignment 4: Clustering and classification

date()
## [1] "Tue Nov 29 02:56:59 2022"

Setting up

# access the libraries
library(MASS)

# load the data
data("Boston")

Exploring the data

Dataset overview

dim(Boston)
## [1] 506  14
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

This week we are working with the “Housing Values in Suburbs of Boston” data. This data describes housing values in 506 suburbs (observations) of Boston and other variables related to it. In total, the data has 14 variables which includes:

Variable Name Description
crim per capita crime rate per town (numeric)
zn proportion of residential land zoned for lots over 25,000 sq.ft (numeric)
indus proportion of non-retail business acres per town (numeric)
chas Charles River dummy variable (integer, “1” if tract bounds river; “0” otherwise)
nox nitrogen oxides concentration (parts per 10 million)
rm average number of rooms per dwelling (numeric)
age proportion of owner-occupied units built prior to 1940 (numeric)
dis weighted mean of distances to five Boston employment centres (numeric)
rad index of accessibility to radial highways (integer)
tax full-value property-tax rate per $10,000 (numeric)
ptratio pupil-teacher ratio by town (numeric)
black 1000(Bk−0.63)2 where Bk is the proportion of blacks by town (numeric)
lstat lower status of the population (percent)
medv median value of owner-occupied homes in $1000s (numeric)

Summary

summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08205   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
  • Crime rate ranges from 0.006 - 88.97. The median (0.26) and the average (3.6) are very different, and the maximum value (88.97) is also quite far from the third quantile (3.7). This means that most of the towns in Boston have very low crime rate, but there are some towns that has very high crime rate and brings up the average.

  • Similar with crime rate, variable zn (proportion of residential land zone for lots over 25,000 sq ft) also has a very skewed distribution. It ranges from 0 to 100. The median lies at 0 while the average is 11.36. It seems like the proportion of residential land zone for most of the towns is 0, but some towns have very high proportion.

  • Proportion of non-retail business acres per town (indus) ranges from 0.46 to 27.74 and averages at 11.14.

  • The variable chas (Charles River dummy variable) is categorical, so it only has 2 values: 0 and 1. The very low average tells us that the value for this variable is 0 for most of the town in this dataset.

  • Nitrogen oxides concentration (nox) ranges from 0.385 to 0.871 and averages at 0.554.

  • The average number of rooms per dwelling (rm) ranges from 3.5 to 8.7 and averages at 6.3.

  • The proportion of owner-occupied units built prior to 1940 (age) ranges from 2.9 to 100 and averages at 68.57.

  • The weighted mean of distances to five Boston employment centres (dis) ranges from 1.13 to 12.13 and averages at 3.79.

  • Index of accessibility to radial highways (rad) ranges from 1 to 24 and averages at 9.5.

  • Full-value property-tax rate per $10,000 (tax) ranges from 187 to 711 and averages at 408.2.

  • The pupil-teacher ratio by town (ptratio) ranges from 12.6 to 22 and averages at 18.46.

  • The proportion of blacks (to be exact it is actually 1000(Bk−0.63)2 where Bk is the proportion of blacks by town) ranges from 0.32 to 396.9 and averages at 356.67.

  • Lower status of the population (lstat) ranges from 1.73 to 37.97 and averages at 12.65.

  • The median value of owner-occupied homes (in thousand dollars) ranges from 5 to 50 and averages at 22.53.

Graphical overview

I would use the ggpairs() function instead of pairs() and corrplot() functions to have a more compact result. I will also divide the variables into two graphs to have a clearer visualization. We will later work with the crime variable as the dependent variable, so I will focus on it. Because we are using housing-related variables to classify crime rate, the interpretations would be a little weird! I also found the variables to be quite complicated, it is hard to express them in simpler words.

#accessing the libraries
library(ggplot2)
library(GGally)
ggpairs(Boston[1:7], mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)))

Confirming the observations we made at the previous section, we can see that crime rate variable (crim) and proportion of residential land (zn) show an extremely skewed distribution. The scatter plot and the correlation coefficient show a low and negative correlation (-0.2) between crime rate and proportion residential land (zn). Similar relationship can be seen between crime rate and average number of rooms (rm). Meanwhile, crime rate has a positive correlation to proportion of non-retail business acres per town (indus, coef = 0.407), nitrogen oxides concentration (nox, coef - 0.421), and proportion of owner-occupied units built prior to 1940 (age, coef = 0.353).

ggpairs(Boston, columns = c("crim", "dis", "rad", "tax", "ptratio", 
                            "black", "lstat", "medv"), mapping = aes(),
        lower = list(combo = wrap("facethist", bins = 20)))

Crime rate is negatively correlated with the weighted mean of distances to five Boston employment centres (dis, coef = -0.38), proportion of black (black, coef = -0.385), and median value of owner-occupied homes (medv, coef = -0.388). Meanwhile, it has a positive relationship with index of accessibility to radial highways (rad, coef = 0.626), full-value property-tax rate (tax, coef = 0.583), pupil-teacher ratio (ptratio, coef = 0.29) and lower status of the population (lstat, coef = 0.456).

Wrangling

Standardize the dataset

# center and standardize variables
boston_scaled <- scale(Boston)

# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)

# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus              chas        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563   Min.   :-0.2723  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668   1st Qu.:-0.2723  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109   Median :-0.2723  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150   3rd Qu.:-0.2723  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202   Max.   : 3.6648  
##       nox                rm               age               dis         
##  Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331   Min.   :-1.2658  
##  1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366   1st Qu.:-0.8049  
##  Median :-0.1441   Median :-0.1084   Median : 0.3171   Median :-0.2790  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059   3rd Qu.: 0.6617  
##  Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164   Max.   : 3.9566  
##       rad               tax             ptratio            black        
##  Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047   Min.   :-3.9033  
##  1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876   1st Qu.: 0.2049  
##  Median :-0.5225   Median :-0.4642   Median : 0.2746   Median : 0.3808  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058   3rd Qu.: 0.4332  
##  Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372   Max.   : 0.4406  
##      lstat              medv        
##  Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 3.5453   Max.   : 2.9865

We standardize the variables to have them all on the same scale. Now, since we have standardized the dataset, all variables have 0 as the average value. They should also have 1 as the standard deviation.

Categorizing crime rate

We will categorize crime rate into four categories: “low”, “med_low”, “med-high”, and “high”. We will use the quantiles as the break points for the categorization. Then, we will replace the original crime rate variable (crim) with the new categorized variable (crime).

# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)

# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c("low", "med_low", "med_high", "high"))

# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

Divide the data into train and test dataset

Now, we will divide the data into two: train dataset and test dataset. The train dataset will have 80% of the original dataset, while the test dataset will have the remaining 20%. The train dataset will be used to build our model while the test dataset will be used to test our model.

library(dplyr)
library(MASS)
library(patchwork)
# number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

Linear discriminant analysis

Modeling

# linear discriminant analysis
lda.fit <- lda(crime ~., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2574257 0.2698020 0.2326733 0.2400990 
## 
## Group means:
##                   zn      indus        chas        nox         rm        age
## low       1.02479729 -0.9405142 -0.08304540 -0.8775489  0.4402499 -0.8596091
## med_low  -0.08777476 -0.2897367 -0.01948777 -0.5770765 -0.1074116 -0.3918435
## med_high -0.36681929  0.2184969  0.10462734  0.4772700  0.1368950  0.4300311
## high     -0.48724019  1.0172187 -0.10997442  1.0617826 -0.4826658  0.8148632
##                 dis        rad        tax     ptratio       black       lstat
## low       0.9159415 -0.6859201 -0.7479900 -0.45291375  0.37809131 -0.77322131
## med_low   0.3261444 -0.5519863 -0.4863138 -0.04514452  0.31656076 -0.18201222
## med_high -0.3867928 -0.4198554 -0.3131006 -0.38387364  0.04801848  0.02281092
## high     -0.8661575  1.6371072  1.5133254  0.77958792 -0.77376007  0.91567556
##                 medv
## low       0.52074099
## med_low   0.03573533
## med_high  0.16947542
## high     -0.72308979
## 
## Coefficients of linear discriminants:
##                 LD1         LD2         LD3
## zn       0.08354228  0.68628331 -0.97216749
## indus    0.05438092 -0.34463038  0.24133768
## chas    -0.12580476 -0.01056504  0.14255563
## nox      0.42432656 -0.83263416 -1.29005146
## rm      -0.09256571 -0.18032511 -0.18708537
## age      0.15162290 -0.27561298 -0.29387696
## dis     -0.04084488 -0.36879816 -0.02550853
## rad      3.37459640  0.98230172 -0.23168829
## tax      0.02050378 -0.04969296  0.62726128
## ptratio  0.11192561  0.02876548 -0.18175195
## black   -0.11791282  0.03922974  0.12235154
## lstat    0.21939091 -0.19404543  0.35302481
## medv     0.18099343 -0.39363635 -0.13849704
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9484 0.0371 0.0146

For each observation in the dataset (for each town in Boston), LDA computes the probability of belonging to each crime rate group. Then, they will be allocated to the group with the highest probability score.

  • Prior probabilities of groups shows the proportion of each group in the data. Since we divided the data using the quantiles and the training set is chosen randomly, the prior probabilities of the groups should not differ much.

  • Group means show the average of each variable in each group. It is a little hard to interpret directly since the variables are already standardized, but we can see that some variables are showing a descending or ascending trend. For example, we can see that the average of rad variable (index of accessibility to radial highways) gets higher in groups with higher crime rate.

  • Coefficients of linear discrimination shows the coefficients of the first, second, and third discriminant function. We use these functions to discriminate the groups.

  • Proportion of trace is the percentage of separation achieved by each discriminant function. So, the first discrimination function achieves around 94 percent of separation.

# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)

The x axis shows the score from the first discriminant function (LD1) and the y axis shows the score from the second discriminant function (LD2). There are some amount of overlaps between the groups. So, if we calculate the discriminant score of a town using the first discriminant function (LD1) and get a high LD1 score (above 4), most probably that town would be categorized as “high” crime rate. The arrows represent each predictor variables. Longer arrow represents higher coefficients in the discriminant function (it is scaled twice as big so we can see the arrows better). We can see that rad variable (index of accessibility to radial highways) is the most discriminating variable and it has positive coefficients in both the first (LD1) and the second (LD2) discriminant functions.

Testing

We will now predict the crime rate classification in our testing set with our LDA model. Then, we will compare the result to the real classification.

# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       10      11        2    0
##   med_low    0      16        1    0
##   med_high   0      16       14    2
##   high       0       0        0   30

Here, we have the cross tabulation of the real and predicted categories. Ideally, the data should all belong to the same groups (towns with low crime rate should be predicted to have low crime rate). Most of the towns are classified correctly, but some are classified incorrectly. We can count the number of observations that are classified correctly by summing the diagonal line (upper left to bottom right) of the cross tabulation. The observations outside of the diagonal line are not classified correctly.

K-means algorithm

Finding the optimal number of clusters

#reloding Boston dataset
data("Boston")

#standardize
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)


# euclidean distance matrix
dist_eu <- dist(boston_scaled)

set.seed(123)

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.

We need to look for the point where the WCSS value changes most significantly. In this graph, the optimal number of cluster seem to be 2.

Run the algorithm with the optimal number of clusters

# k-means clustering
km <- kmeans(boston_scaled, centers = 2)

# plot the Boston dataset with clusters
pairs(boston_scaled, col = km$cluster)

We divided the data into two clusters, expressed with different colors in the graph. In some variables, the clusters are defined quite clearly while in some others the difference is less clear. For example, if we look at the graphs of crim (crime rate) variable, it is quite clear that towns with low crime rate are clustered together. Meanwhile, the clusters are less clear with dis (weighted mean of distances to five Boston employment centres) variable. Although there is a tendency that the red cluster is within the lower ranges, the separation is not very clear.

Bonus k-means

Creating the clusters

#reloding Boston dataset
data("Boston")

#standardize
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)

# k-means clustering
km <- kmeans(boston_scaled, centers = 4)

#extracting the result
boston_scaled <- cbind(boston_scaled, cluster = km$cluster)

LDA

# linear discriminant analysis
fit <- lda(cluster ~., data = boston_scaled)

lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(boston_scaled$cluster)

# plot the lda results
plot(fit, dimen = 2, col = classes, pch = classes)
lda.arrows(fit, myscale = 3)

There are some overlap, but in general the data is separated quite clearly. Variable chas (Charles River dummy variable) seem to be the most influential separators among the variables in the dataset. Other influential separators are rad (index of accessibility to radial highways) and dis (weighted mean of distances to five Boston employment centres).

Super bonus

Create the matrix product

model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)

Create 3D plot

library(plotly)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)
model_predictors <- dplyr::select(boston_scaled, -cluster)
# check the dimensions
dim(model_predictors)
## [1] 506  14
dim(fit$scaling)
## [1] 14  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% fit$scaling
matrix_product <- as.data.frame(matrix_product)
#plot
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = boston_scaled$cluster)

The two plot differs because the classification is different. The first plot is classifying the murder rate, while the second one is classifying the cluster. The classification in the second plot is much clearer, probably because the clusters were made based on the whole data itself. However, in both plots the data seem to be separated into two big groups.